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Abstract 

This  paper  presents  Hedge  Trimmer,  a  HEaDline 
GEneration  system  that  creates  a  headline  for  a  newspa¬ 
per  story  using  linguistically-motivated  heuristics  to 
guide  the  choice  of  a  potential  headline.  We  present 
feasibility  tests  used  to  establish  the  validity  of  an  ap¬ 
proach  that  constructs  a  headline  by  selecting  words  in 
order  from  a  story.  In  addition,  we  describe  experimen¬ 
tal  results  that  demonstrate  the  effectiveness  of  our  lin¬ 
guistically-motivated  approach  over  a  HMM-based 
model,  using  both  human  evaluation  and  automatic  met¬ 
rics  for  comparing  the  two  approaches. 

1  Introduction 

In  this  paper  we  present  Hedge  Trimmer,  a  HEaD¬ 
line  GEneration  system  that  creates  a  headline  for  a 
newspaper  story  by  removing  constituents  from  a  parse 
tree  of  the  first  sentence  until  a  length  threshold  has 
been  reached.  Linguistically-motivated  heuristics  guide 
the  choice  of  which  constituents  of  a  story  should  be 
preserved,  and  which  ones  should  be  deleted.  Our  focus 
is  on  headline  generation  for  English  newspaper  texts, 
with  an  eye  toward  the  production  of  document  surro¬ 
gates — for  cross-language  information  retrieval — and 
the  eventual  generation  of  readable  headlines  from 
speech  broadcasts. 

In  contrast  to  original  newspaper  headlines,  which 
are  often  intended  only  to  catch  the  eye,  our  approach 
produces  informative  abstracts  describing  the  main 
theme  or  event  of  the  newspaper  article.  We  claim  that 
the  construction  of  informative  abstracts  requires  access 
to  deeper  linguistic  knowledge,  in  order  to  make  sub¬ 
stantial  improvements  over  purely  statistical  ap¬ 
proaches. 

In  this  paper,  we  present  our  technique  for  produc¬ 
ing  headlines  using  a  parse-and-trim  approach  based  on 
the  BBN  Parser.  As  described  in  Miller  et  al.  (1998),  the 
BBN  parser  builds  augmented  parse  trees  according  to  a 
process  similar  to  that  described  in  Collins  (1997). 
The  BBN  parser  has  been  used  successfully  for  the  task 


of  information  extraction  in  the  SIET  system  (Miller  et 
al.,  2000). 

The  next  section  presents  previous  work  in  the  area 
of  automatic  generation  of  abstracts.  Eollowing  this,  we 
present  feasibility  tests  used  to  establish  the  validity  of 
an  approach  that  constructs  headlines  from  words  in  a 
story,  taken  in  order  and  focusing  on  the  earlier  part  of 
the  story.  Next,  we  describe  the  application  of  the 
parse-and-trim  approach  to  the  problem  of  headline 
generation.  We  discuss  the  linguistically-motivated 
heuristics  we  use  to  produce  results  that  are  headline¬ 
like.  Einally,  we  evaluate  Hedge  Trimmer  by  compar¬ 
ing  it  to  our  earlier  work  on  headline  generation,  a  prob¬ 
abilistic  model  for  automatic  headline  generation  (Zajic 
et  al,  2002).  In  this  paper  we  will  refer  to  this  statistical 
system  as  HMM  Hedge  We  demonstrate  the  effective¬ 
ness  of  our  linguistically-motivated  approach.  Hedge 
Trimmer,  over  the  probabilistic  model,  HMM  Hedge, 
using  both  human  evaluation  and  automatic  metrics. 

2  Previous  Work 

Other  researchers  have  investigated  the  topic  of 
automatic  generation  of  abstracts,  but  the  focus  has  been 
different,  e.g.,  sentence  extraction  (Edmundson,  1969; 
Johnson  et  al,  1993;  Kupiec  et  al.,  1995;  Mann  et  al., 
1992;  Teufel  and  Moens,  1997;  Zechner,  1995),  proc¬ 
essing  of  structured  templates  (Paice  and  Jones,  1993), 
sentence  compression  (Hori  et  al.,  2002;  Knight  and 
Marcu,  2001;  Grefenstette,  1998,  Luhn,  1958),  and  gen¬ 
eration  of  abstracts  from  multiple  sources  (Radev  and 
McKeown,  1998).  We  focus  instead  on  the  construction 
of  headline-style  abstracts  from  a  single  story. 

Headline  generation  can  be  viewed  as  analogous  to 
statistical  machine  translation,  where  a  concise  docu¬ 
ment  is  generated  from  a  verbose  one  using  a  Noisy 
Channel  Model  and  the  Viterbi  search  to  select  the  most 
likely  summarization.  This  approach  has  been  explored 
in  (Zajic  et  al.,  2002)  and  (Banko  et  al.,  2000). 

The  approach  we  use  in  Hedge  is  most  similar  to 
that  of  (Knight  and  Marcu,  2001),  where  a  single  sen¬ 
tence  is  shortened  using  statistical  compression.  As  in 
this  work,  we  select  headline  words  from  story  words  in 
the  order  that  they  appear  in  the  story — in  particular,  the 
first  sentence  of  the  story.  However,  we  use  linguist!- 
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cally  motivated  heuristics  for  shortening  the  sentence; 
there  is  no  statistical  model,  which  means  we  do  not 
require  any  prior  training  on  a  large  corpus  of 
story/headline  pairs. 

Linguistically  motivated  heuristics  have  been  used 
by  (McKeown  et  al,  2002)  to  distinguish  constituents  of 
parse  trees  which  can  be  removed  without  affecting 
grammaticality  or  correctness.  GLEANS  (Daume  et  al, 
2002)  uses  parsing  and  named  entity  tagging  to  fill  val¬ 
ues  in  headline  templates. 

Consider  the  following  excerpt  from  a  news  story: 

(1)  Story  Words:  Kurdish  guerilla  forces  moving 
with  lightning  speed  poured  into  Kirkuk  today 
immediately  after  Iraqi  troops,  fleeing  relent¬ 
less  U.S.  airstrikes,  abandoned  the  hub  of  Iraq’s 
rich  northern  oil  fields. 

Generated  Headline:  Kurdish  guerilla  forces 
poured  into  Kirkuk  after  Iraqi  troops  abandoned 
oil  fields. 

In  this  case,  the  words  in  bold  form  a  fluent  and 
accurate  headline  for  the  story.  Italicized  words  are 
deleted  based  on  information  provided  in  a  parse-tree 
representation  of  the  sentence. 

3  Feasibility  Testing 

Our  approach  is  based  on  the  selection  of  words 
from  the  original  story,  in  the  order  that  they  appear  in 
the  story,  and  allowing  for  morphological  variation.  To 
determine  the  feasibility  of  our  headline-generation  ap¬ 
proach,  we  first  attempted  to  apply  our  “select-words- 
in-order”  technique  by  hand.  We  asked  two  subjects  to 
write  headline  headlines  for  73  AP  stories  from  the 
TIPSTER  corpus  for  January  1,  1989,  by  selecting 
words  in  order  from  the  story.  Of  the  146  headlines,  2 
did  not  meet  the  “select-words-in-order”  criteria  be¬ 
cause  of  accidental  word  reordering.  We  found  that  at 
least  one  fluent  and  accurate  headline  meeting  the  crite¬ 
ria  was  created  for  each  of  the  stories.  The  average 
length  of  the  headlines  was  10.76  words. 

Later  we  examined  the  distribution  of  the  headline 
words  among  the  sentences  of  the  stories,  i.e.  how  many 
came  from  the  first  sentence  of  a  story,  how  many  from 
the  second  sentence,  etc.  The  results  of  this  study  are 
shown  in  Eigure  1.  We  observe  that  86.8%  of  the  head¬ 
line  words  were  chosen  from  the  first  sentence  of  their 
stories.  We  performed  a  subsequent  study  in  which  two 
subjects  created  100  headlines  for  100  AP  stories  from 
August  6,  1990.  51.4%  of  the  headline  words  in  the 
second  set  were  chosen  from  the  first  sentence.  The 
distribution  of  headline  words  for  the  second  set  shown 
in  Eigure  2. 


Although  humans  do  not  always  select  headline 
words  from  the  first  sentence,  we  observe  that  a  large 
percentage  of  headline  words  are  often  found  in  the  first 
sentence. 


Eigure  1 :  Percentage  of  words  from  human-generated 
headlines  drawn  from  Nth  sentence  of  story  (Set  1) 


Figure  2:  Percentage  of  words  from  human-generated  head¬ 
lines  drawn  from  Nth  sentence  of  story  (Set  2) 


4  Approach 

The  input  to  Hedge  is  a  story,  whose  first  sentence  is 
immediately  passed  through  the  BBN  parser.  The 
parse-tree  result  serves  as  input  to  a  linguistically- 
motivated  module  that  selects  story  words  to  form  head¬ 
lines  based  on  key  insights  gained  from  our  observa¬ 
tions  of  human-constructed  headlines.  That  is,  we 
conducted  a  human  inspection  of  the  73  TIPSTER  sto¬ 
ries  mentioned  in  Section  3  for  the  purpose  of  develop¬ 
ing  the  Hedge  Trimmer  algorithm. 

Based  on  our  observations  of  human-produced 
headlines,  we  developed  the  following  algorithm  for 
parse-tree  trimming: 

1 .  Choose  lowest  leftmost  S  with  NP, VP 

2.  Remove  low  content  units 


o  some  determiners 
o  time  expressions 

3.  Iterative  shortening: 

o  XP  Reduction 
o  Remove  preposed  adjuncts 
o  Remove  trailing  PPs 

o  Remove  trailing  SBARs 

More  recently,  we  conducted  an  automatic  analysis 
of  the  human-generated  headlines  that  supports  several 
of  the  insights  gleaned  from  this  initial  study.  We  parsed 
218  human-produced  headlines  using  the  BBN  parser 
and  analyzed  the  results.  For  this  analysis,  we  used  72 
headlines  produced  by  a  third  participant.*  The  parsing 
results  included  957  noun  phrases  (NP)  and  315  clauses 
(S). 

We  calculated  percentages  based  on  headline-level, 
NP-level,  and  Sentence-level  structures  in  the  parsing 
results.  That  is,  we  counted: 

•  The  percentage  of  the  957  NPs  containing  de¬ 
terminers  and  relative  clauses 

•  The  percentage  of  the  218  headlines  containing 
preposed  adjuncts  and  conjoined  S  or  VPs 

•  The  percentage  of  the  315  S  nodes  containing 
trailing  time  expressions,  SBARs,  and  PPs 

Figure  3  summarizes  the  results  of  this  automatic  analy¬ 
sis.  In  our  initial  human  inspection,  we  considered  each 
of  these  categories  to  be  reasonable  candidates  for  dele¬ 
tion  in  our  parse  tree  and  this  automatic  analysis  indi¬ 
cates  that  we  have  made  reasonable  choices  for  deletion, 
with  the  possible  exception  of  trailing  PPs,  which  show 
up  in  over  half  of  the  human-generated  headlines.  This 
suggests  that  we  should  proceed  with  caution  with  re¬ 
spect  to  the  deletion  of  trailing  PPs;  thus  we  consider 
this  to  be  an  option  only  if  no  other  is  available. 


HEADLINE-LEVEL  PERCENTAGES 

preposed  adjuncts  =  0/218  (0%) 
conjoined  S  =  1/218  ( .5%) 

conjoined  VP  =  7/218  (3%) _ 

NP-LEVEL  PERCENTAGES 
relative  clauses  =  3/957  (.3%) 
determiners  =  31/957  (3%);  of  these,  only 
16  were  “a”  or  “the”  (1.6%  overall) 
S-LEVEL  PERCENTAGES^ 
time  expressions  =  5/315  (1.5%) 
trailing  PPs  =  165/315  (52%) 

trailing  SBARs  =  24/315  (8%) _ 

Figure  3:  Percentages  found  in  human-generated  headlines 


’  No  response  was  given  for  one  of  the  73  stories. 

^  Trailing  constituents  (SBARs  and  PPs)  are  computed  by 
counting  the  number  of  SBARs  (or  PPs)  not  designated  as  an 
argument  of  (contained  in)  a  verb  phrase. 


For  a  comparison,  we  conducted  a  second  analysis 
in  which  we  used  the  same  parser  on  just  the  first  sen¬ 
tence  of  each  of  the  73  stories.  In  this  second  analysis, 
the  parsing  results  included  817  noun  phrases  (NP)  and 
316  clauses  (S).  A  summary  of  these  results  is  shown  in 
Figure  4.  Note  that,  across  the  board,  the  percentages 
are  higher  in  this  analysis  than  in  the  results  shown  in 
Figure  3  (ranging  from  12%  higher — in  the  case  of  trail¬ 
ing  PPs — to  1500%  higher  in  the  case  of  time  expres¬ 
sions),  indicating  that  our  choices  of  deletion  in  the 
Hedge  Trimmer  algorithm  are  well-grounded. 


HEADLINE-LEVEL  PERCENTAGES 

preposed  adjuncts  =  2/73  (2.7%) 
conjoined  S  =  3/73  (4%) 

conjoined  VP  =  20/73  (27%) _ 

NP-LEVEL  PERCENTAGES 
relative  clauses  =  29/817  (3.5%) 
determiners  =  205/817  (25%);  of  these, 
only  171  were  “a”  or  “the”  (21%  overall) 
S-LEVEL  PERCENTAGES 
time  expressions  =  77/316  (24%) 
trailing  PPs  =  184/316  (58%) 

trailing  SBARs  =  49/316  (16%) _ 

Figure  4:  Percentages  found  in  first  sentence  of 
each  story. 

4.1  Choose  the  Correct  S  Node 

The  first  step  relies  on  what  is  referred  to  as  the 
Projection  Principle  in  linguistic  theory  (Chomsky, 
1981):  Predicates  project  a  subject  (both  dominated  by 
S)  in  the  surface  structure.  Our  human-generated  head¬ 
lines  always  conformed  to  this  rule;  thus,  we  adopted  it 
as  a  constraint  in  our  algorithm. 

An  example  of  the  application  of  step  1  above  is 
the  following,  where  boldfaced  material  from  the  parse 
tree  representation  is  retained  and  italicized  material  is 
eliminated: 

(2)  Input:  Rebels  agree  to  talks  with  government  of¬ 
ficials  said  Tuesday. 

Parse:  [S  [S  [NP  Rebels]  [VP  agree  to  talks 
with  government]]  ojficials  said  Tuesday.] 

Output  of  step  1 :  Rebels  agree  to  talks  with  gov¬ 
ernment. 

When  the  parser  produces  a  correct  tree,  this  step  pro¬ 
vides  a  grammatical  headline.  However,  the  parser  of¬ 
ten  produces  an  incorrect  output.  Human  inspection  of 
our  624-sentence  DUC-2003  evaluation  set  revealed 
that  there  were  two  such  scenarios,  illustrated  by  the 
following  cases: 


(3)  [S  [SBAR  What  started  as  a  local  contro¬ 
versy]  [VP  has  evolved  into  an  international 
scandal.]] 

(4)  [NP  [NP  Bangladesh]  [CC  and]  [NP  [NP  In¬ 
dia]  [VP  signed  a  water  sharing  accord.]]] 

In  the  first  case,  an  S  exists,  but  it  does  not  conform 
to  the  requirements  of  step  1 .  This  occurred  in  2.6%  of 
the  sentences  in  the  DUC-2003  evaluation  data.  We 
resolve  this  by  selecting  the  lowest  leftmost  S,  i.e.,  the 
entire  string  “What  started  as  a  local  controversy  has 
evolved  into  an  international  scandaf’  in  the  example 
above. 

In  the  second  case,  there  is  no  S  available.  This  oc¬ 
curred  in  3.4%  of  the  sentences  in  the  evaluation  data. 
We  resolve  this  by  selecting  the  root  of  the  parse  tree; 
this  would  be  the  entire  string  “Bangladesh  and  India 
signed  a  water  sharing  accord”  above.  No  other  parser 
errors  were  encountered  in  the  DUC-2003  evaluation 
data. 

4.2  Removal  of  Low  Content  Nodes 

Step  2  of  our  algorithm  eliminates  low-content 
units.  We  start  with  the  simplest  low-content  units:  the 
determiners  a  and  the.  Other  determiners  were  not  con¬ 
sidered  for  deletion  because  our  analysis  of  the  human- 
constructed  headlines  revealed  that  most  of  the  other 
determiners  provide  important  information,  e.g.,  nega¬ 
tion  (not),  quantifiers  (each,  many,  several),  and  deictics 
(this,  that). 

Beyond  these,  we  found  that  the  human-generated 
headlines  contained  very  few  time  expressions  which, 
although  certainly  not  content-free,  do  not  contribute 
toward  conveying  the  overall  “who/what  content”  of  the 
story.  Since  our  goal  is  to  provide  an  informative  head¬ 
line  (i.e.,  the  action  and  its  participants),  the  identifica¬ 
tion  and  elimination  of  time  expressions  provided  a 
significant  boost  in  the  performance  of  our  automatic 
headline  generator. 

We  identified  time  expressions  in  the  stories  using 
BBN’s  IdentiFinder™  (Bikel  et  al,  1999).  We  imple¬ 
mented  the  elimination  of  time  expressions  as  a  two- 
step  process: 

•  Use  IdentiFinder  to  mark  time  expressions 

•  Remove  [PP  ...  [NP  [X]  ...]  ...]  and  [NP  [X]] 
where  X  is  tagged  as  part  of  a  time  expression 

The  following  examples  illustrate  the  application  of 
this  step: 

(5)  Input:  The  State  Department  on  Friday  lifted  the 
ban  it  had  imposed  on  foreign  fliers. 


Parse:  [Det  The]  State  Department  [PP  [IN 
on]  [NP  [NNP  Friday] [[  lifted  [Det  the]  ban  it 

had  imposed  on  foreign  fliers. 

Output  of  step  2:  State  Department  lifted  ban  it 
has  imposed  on  foreign  fliers. 

(6)  Input:  An  international  relief  agency  announced 
Wednesday  that  it  is  withdrawing  from  North 
Korea. 

Parse:  [Det  An]  international  relief  agency  an¬ 
nounced  [NP  [NNP  Wednesday] [  that  it  is  with¬ 
drawing  from  North  Korea. 

Output  of  step  2:  International  relief  agency  an¬ 
nounced  that  it  is  withdrawing  from  North  Korea. 

We  found  that  53.2%  of  the  stories  we  examined 
contained  at  least  one  time  expression  which  could  be 
deleted.  Human  inspection  of  the  50  deleted  time  ex¬ 
pressions  showed  that  38  were  desirable  deletions,  10 
were  locally  undesirable  because  they  introduced  an 
ungrammatical  fragment,^  and  2  were  undesirable  be¬ 
cause  they  removed  a  potentially  relevant  constituent. 
However,  even  an  undesirable  deletion  often  pans  out 
for  two  reasons:  (1)  the  ungrammatical  fragment  is  fre¬ 
quently  deleted  later  by  some  other  rule;  and  (2)  every 
time  a  constituent  is  removed  it  makes  room  under  the 
threshold  for  some  other,  possibly  more  relevant  con¬ 
stituent.  Consider  the  following  examples. 

(7)  At  least  two  people  were  killed  Sunday. 

(8)  At  least  two  people  were  killed  when  single¬ 
engine  airplane  crashed. 

Example  (7)  was  produced  by  a  system  which  did  not 
remove  time  expressions.  Example  (8)  shows  that  if  the 
time  expression  Sunday  were  removed,  it  would  make 
room  below  the  10-word  threshold  for  another  impor¬ 
tant  piece  of  information. 

4.3  Iterative  Shortening 

The  final  step,  iterative  shortening,  removes  lin¬ 
guistically  peripheral  material — through  successive  de¬ 
letions — until  the  sentence  is  shorter  than  a  given 
threshold.  We  took  the  threshold  to  be  10  for  the  DUC 
task,  but  it  is  a  configurable  parameter.  Also,  given  that 
the  human-generated  headlines  tended  to  retain  earlier 
material  more  often  than  later  material,  much  of  our 


Two  examples  of  genuinely  undesirable  time  expression  deletion 
are: 

•  The  attack  came  on  the  heels  of  [New  Year’s  Day]. 

•  [New  Year’s  Day]  brought  a  foot  of  snow  to  the  region. 


iterative  shortening  is  focused  on  deleting  the  rightmost 
phrasal  categories  until  the  length  is  below  threshold. 

There  are  four  types  of  iterative  shortening  rules. 
The  first  type  is  a  rule  we  call  “XP-over-XP,”  which  is 
implemented  as  follows: 

In  constructions  of  the  form  [XP  [XP  . . .]  ...]  re¬ 
move  the  other  children  of  the  higher  XP,  where 
XP  is  NP,  VP  or  S. 

This  is  a  linguistic  generalization  that  allowed  us  apply 
a  single  rule  to  capture  three  different  phenomena  (rela¬ 
tive  clauses,  verb-phrase  conjunction,  and  sentential 
conjunction).  The  rule  is  applied  iteratively,  from  the 
deepest  rightmost  applicable  node  backwards,  until  the 
length  threshold  is  reached. 

The  impact  of  XP-over-XP  can  be  seen  in  these  ex¬ 
amples  of  NP-over-NP  (relative  clauses),  VP -over-VP 
(verb-phrase  conjunction),  and  S-over-S  (sentential  con¬ 
junction),  respectively: 

(9)  Input:  A  fire  killed  a  firefighter  who  was  fatally 
injured  as  he  searched  the  house. 

Parse:  [S  [Det  A]  fire  killed  [Det  a]  [NP  [NP 
firefighter]  [SBAR  who  was  fatally  injured  as 
he  searched  the  house]  ]] 

Output  of  NP-over-NP:  fire  killed  firefighter 

(10)  Input:  Illegal  fireworks  injured  hundreds  of  peo¬ 
ple  and  started  six  fires. 

Parse:  [S  Illegal  fireworks  [VP  [VP  injured 
hundreds  of  people]  [CC  and]  ]VP  started  six 
fires]  ]] 

Output  of  VP-over-VP:  Illegal  fireworks  injured 
hundreds  of  people 

(1 1)  Input:  A  company  offering  blood  cholesterol 
tests  in  grocery  stores  says  medical  technology 
has  outpaced  state  laws,  but  the  state  says  the 
company  doesn’t  have  the  proper  licenses. 

Parse:  [S  ]Det  A]  company  offering  blood  cho¬ 
lesterol  tests  in  grocery  stores  says  [S  [S  medi¬ 
cal  technology  has  outpaced  state  laws],  ]CC 

but]  ]S  ]Det  the]  state  stays  ]Det  the]  company 
doesn’t  have  ]Det  the] proper  licenses.]]  ] 

Output  of  S-over-S:  Company  offering  blood 
cholesterol  tests  in  grocery  store  says  medical 
technology  has  outpaced  state  laws 


The  second  type  of  iterative  shortening  is  the  re¬ 
moval  of  preposed  adjuncts.  The  motivation  for  this 
type  of  shortening  is  that  all  of  the  human-generated 
headlines  ignored  what  we  refer  to  as  the  preamble  of 
the  story.  Assuming  the  Projection  principle  has  been 
satisfied,  the  preamble  is  viewed  as  the  phrasal  material 
occurring  before  the  subject  of  the  sentence.  Thus,  ad¬ 
juncts  are  identified  linguistically  as  any  XP  unit  pre¬ 
ceding  the  first  NP  (the  subject)  under  the  S  chosen  by 
step  1.  This  type  of  phrasal  modifier  is  invisible  to  the 
XP-over-XP  rule,  which  deletes  material  under  a  node 
only  if  it  dominates  another  node  of  the  same  phrasal 
category. 

The  impact  of  this  type  of  shortening  can  be  seen  in 
the  following  example: 

(12)  Input:  According  to  a  now  finalized  blueprint  de¬ 
scribed  by  U.S.  officials  and  other  sources,  the 
Bush  administration  plans  to  take  complete,  unilat¬ 
eral  control  of  a  post-Saddam  Hussein  Iraq 

Parse:  [S  ]PP  According  to  a  now-finalized  blue¬ 
print  described  by  U.S.  officials  and  other  sources] 
]Det  the]  Bush  administration  plans  to  take 
complete,  unilateral  control  of  ]Det  a]  post- 
Saddam  Hussein  Iraq  ] 

Output  of  Preposed  Adjunct  Removal:  Bush  ad¬ 
ministration  plans  to  take  complete  unilateral  con¬ 
trol  of  post-Saddam  Hussein  Iraq 

The  third  and  fourth  types  of  iterative  shortening 
are  the  removal  of  trailing  PPs  and  SBARs,  respec¬ 
tively: 

•  Remove  PPs  from  deepest  rightmost  node  back¬ 
ward  until  length  is  below  threshold. 

•  Remove  SBARs  from  deepest  rightmost  node 
backward  until  length  is  below  threshold. 

These  are  the  riskiest  of  the  iterative  shortening  rules, 
as  indicated  in  our  analysis  of  the  human-generated 
headlines.  Thus,  we  apply  these  conservatively,  only 
when  there  are  no  other  categories  of  rules  to  apply. 
Moreover,  these  rules  are  applied  with  a  backoff  option 
to  avoid  over-trimming  the  parse  tree.  First  the  PP 
shortening  rule  is  applied.  If  the  threshold  has  been 
reached,  no  more  shortening  is  done.  However,  if  the 
threshold  has  not  been  reached,  the  system  reverts  to  the 
parse  tree  as  it  was  before  any  PPs  were  removed,  and 
applies  the  SBAR  shortening  rule.  If  the  threshold  still 
has  not  been  reached,  the  PP  rule  is  applied  to  the  result 
of  the  SBAR  rule. 

Other  sequences  of  shortening  rules  are  possible. 
The  one  above  was  observed  to  produce  the  best  results 
on  a  73 -sentence  development  set  of  stories  from  the 


TIPSTER  corpus.  The  intuition  is  that,  when  removing 
constituents  from  a  parse  tree,  it’s  best  to  remove 
smaller  portions  during  each  iteration,  to  avoid  produc¬ 
ing  trees  with  undesirably  few  words.  PPs  tend  to  rep¬ 
resent  small  parts  of  the  tree  while  SBARs  represent 
large  parts  of  the  tree.  Thus  we  try  to  reach  the  thresh¬ 
old  by  removing  small  constituents,  but  if  we  can’t 
reach  the  threshold  that  way,  we  restore  the  small  con¬ 
stituents,  remove  a  large  constituent  and  resume  the 
deletion  of  small  constituents. 

The  impact  of  these  two  types  of  shortening  can  be 
seen  in  the  following  examples: 

(13)  Input:  More  oil-covered  sea  birds  were  found 
over  the  weekend. 

Parse:  [S  More  oil-covered  sea  birds  were 

found  [PP  over  the  weekend]} 

Output  of  PP  Removal:  More  oil-covered  sea 
birds  were  found. 

(14)  Input:  Visiting  China  Interpol  chief  expressed 
confidence  in  Hong  Kong’s  smooth  transition 
while  assuring  closer  cooperation  after  Hong 
Kong  returns. 

Parse:  [S  Visiting  China  Interpol  chief  ex¬ 
pressed  confidence  in  Hong  Kong’s  smooth 
transition  [SBAR  while  assuring  closer  coopera¬ 
tion  after  Hong  Kong  returns]] 

Output  of  SBAR  Removal:  Visiting  China  Inter¬ 
pol  chief  expressed  confidence  in  Hong  Kong’s 
smooth  transition 


5  Evaluation 

We  conducted  two  evaluations.  One  was  an  informal 
human  assessment  and  one  was  a  formal  automatic 
evaluation. 

5.1  HMM  Hedge 

We  compared  our  current  system  to  a  statistical 
headline  generation  system  we  presented  at  the  2001 
DUC  Summarization  Workshop  (Zajic  et  ah,  2002), 
which  we  will  refer  to  as  HMM  Hedge.  HMM  Hedge 
treats  the  summarization  problem  as  analogous  to  statis¬ 
tical  machine  translation.  The  verbose  language,  arti¬ 
cles,  is  treated  as  the  result  of  a  concise  language, 
headlines,  being  transmitted  through  a  noisy  channel. 
The  result  of  the  transmission  is  that  extra  words  are 
added  and  some  morphological  variations  occur.  The 
Viterbi  algorithm  is  used  to  calculate  the  most  likely 


unseen  headline  to  have  generated  the  seen  article.  The 
Viterbi  algorithm  is  biased  to  favor  headline-like  char¬ 
acteristics  gleaned  from  observation  of  human  perform¬ 
ance  of  the  headline-construction  task.  Since  the  2002 
Workshop,  HMM  Hedge  has  been  enhanced  by  incorpo¬ 
rating  part  of  speech  of  information  into  the  decoding 
process,  rejecting  headlines  that  do  not  contain  a  word 
that  was  used  as  a  verb  in  the  story,  and  allowing  mor¬ 
phological  variation  only  on  words  that  were  used  as 
verbs  in  the  story.  HMM  Hedge  was  trained  on  700,000 
news  articles  and  headlines  from  the  TIPSTER  corpus. 

5.2  Bleu:  Automatic  Evaluation 

BLEU  (Papineni  et  al,  2002)  is  a  system  for  auto¬ 
matic  evaluation  of  machine  translation.  BLEU  uses  a 
modified  n-gram  precision  measure  to  compare  machine 
translations  to  reference  human  translations.  We  treat 
summarization  as  a  type  of  translation  from  a  verbose 
language  to  a  concise  one,  and  compare  automatically 
generated  headlines  to  human  generated  headlines. 

Eor  this  evaluation  we  used  100  headlines  created 
for  100  AP  stories  from  the  TIPSTER  collection  for 
August  6,  1990  as  reference  summarizations  for  those 
stories.  These  100  stories  had  never  been  run  through 
either  system  or  evaluated  by  the  authors  prior  to  this 
evaluation.  We  also  used  the  2496  manual  abstracts  for 
the  DUC2003  10- word  summarization  task  as  reference 
translations  for  the  624  test  documents  of  that  task.  We 
used  two  variants  of  HMM  Hedge,  one  which  selects 
headline  words  from  the  first  60  words  of  the  story,  and 
one  which  selects  words  from  the  first  sentence  of  the 
story.  Table  1  shows  the  BLEU  score  using  trigrams, 
and  the  95%  confidence  interval  for  the  score. 


AP900806 

DUC2003 

HMM60 

0.0997  +  0.0322 
avg  len:  8.62 

0.1050  +  0.0154 
avg  len:  8.54 

HMM  1  Sent 

0.0998  +  0.0354 
avg  len:  8.78 

0.1115  +  0.0173 
avg  len:  8.95 

HedgeTr 

0.1067  +  0.0301 
avg  len:  8.27 

0.1341+0.0181 
avg  len:  8.50 

Table  1 


These  results  show  that  although  Hedge  Trimmer 
scores  slightly  higher  than  HMM  Hedge  on  both  data 
sets,  the  results  are  not  statistically  significant.  How¬ 
ever,  we  believe  that  the  difference  in  the  quality  of  the 
systems  is  not  adequately  reflected  by  this  automatic 
evaluation. 

5.3  Human  Evaluation 

Human  evaluation  indicates  significantly  higher 
scores  than  might  be  guessed  from  the  automatic 
evaluation.  Eor  the  100  AP  stories  from  the  TIPSTER 
corpus  for  August  6,  1990,  the  output  of  Hedge  Trim- 


mer  and  HMM  Hedge  was  evaluated  by  one  human. 
Each  headline  was  given  a  subjective  score  from  1  to  5, 
with  1  being  the  worst  and  5  being  the  best.  The  aver¬ 
age  score  of  HMM  Hedge  was  3.01  with  standard  devia¬ 
tion  of  1.11.  The  average  score  of  Hedge  Trimmer  was 
3.72  with  standard  deviation  of  1.26.  Using  a  t-score, 
the  difference  is  significant  with  greater  than  99.9% 
confidence. 

The  types  of  problems  exhibited  by  the  two  systems  are 
qualitatively  different.  The  probabilistic  system  is  more 
likely  to  produce  an  ungrammatical  result  or  omit  a  nec¬ 
essary  argument,  as  in  the  examples  below. 

(15) HMM60:  Nearly  drowns  in  satisfactory  condi¬ 

tion  satisfactory  condition. 

(16) HMM60:  A  county  jail  inmate  who  noticed. 

In  contrast,  the  parser-based  system  is  more  likely 
to  fail  by  producing  a  grammatical  but  semantically 
useless  headline. 

(17) HedgeTr:  It  may  not  be  everyone’s  idea  espe¬ 

cially  coming  on  heels. 

Finally,  even  when  both  systems  produce  accept¬ 
able  output.  Hedge  Trimmer  usually  produces  headlines 
which  are  more  fluent  or  include  more  useful  informa¬ 
tion. 

(18)  a.  HMM60:  New  Year’s  eve  capsizing 

b.  HedgeTr:  Sightseeing  cruise  boat  capsized 
and  sank. 

(19)  a.  HMM60:  hundreds  of  Tibetan  students 
demonstrate  in  Lhasa. 

b.  HedgeTr:  Hundreds  demonstrated  in  Lhasa 
demanding  that  Chinese  authorities  respect  cul¬ 
ture. 

6  Conclusions  and  Future  Work 

We  have  shown  the  effectiveness  of  constructing 
headlines  by  selecting  words  in  order  from  a  newspaper 
story.  The  practice  of  selecting  words  from  the  early 
part  of  the  document  has  been  justified  by  analyzing  the 
behavior  of  humans  doing  the  task,  and  by  automatic 
evaluation  of  a  system  operating  on  a  similar  principle. 

We  have  compared  two  systems  that  use  this  basic 
technique,  one  taking  a  statistical  approach  and  the 
other  a  linguistic  approach.  The  results  of  the  linguisti¬ 
cally  motivated  approach  show  that  we  can  build  a 
working  system  with  minimal  linguistic  knowledge  and 
circumvent  the  need  for  large  amounts  of  training  data. 
We  should  be  able  to  quickly  produce  a  comparable 
system  for  other  languages,  especially  in  light  of  current 


multi-lingual  initiatives  that  include  automatic  parser 
induction  for  new  languages,  e.g.  the  TIDES  initiative. 

We  plan  to  enhance  Hedge  Trimmer  by  using  a 
language  model  of  Headlinese,  the  language  of  newspa¬ 
per  headlines  (Mardh  1980)  to  guide  the  system  in 
which  constituents  to  remove.  We  Also  we  plan  to  al¬ 
low  for  morphological  variation  in  verbs  to  produce  the 
present  tense  headlines  typical  of  Headlinese. 

Hedge  Trimmer  will  be  installed  in  a  translingual 
detection  system  for  enhanced  display  of  document  sur¬ 
rogates  for  cross-language  question  answering.  This 
system  will  be  evaluated  in  upcoming  iCLEF  confer¬ 
ences. 
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